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Abstract 

Precise classification of a tumor is imperative to apply the best possible therapy for the individual 
patient and to make prediction about clinically outcome. The standard treatment of patient with 
Dukes' C colon cancer includes post-chtrurgic chemotherapy, whereas Dukes' B patients receive no 
chemotherapy. Patient with CRC tumors exhibiting microsatellite instability (MSI) have a better 
prognosis compared to patients with microsatellite stable tumors (MSS) but recent research showed 
that only patients with MSS benefit from chemotherapy. It is therefore of clinical relevance io 
identify patient with MSS tumors for chemotherapy independent on tumor stage. 
The aim of this study was to build a robust classifier based on gene expression to separate MSS 
from MSI tumors. The robustness was achieved by collecting the tumors from 14 different climes in 
two different countries, isolating RNA with two methods at three different sites, and labeling RNA 
in 10 separate batches. DNA micro array analysis was performed on 38 Danish and 64 Finnish 
tumors from primary CRC patients and 17 normal samples, Unsupervised hierarchical clustering 
analysis identified microsatellite instability as the main clinical features separating the samples into 
groups. In addition we found a weaker but clear separation as to the country of origin of the 
samples. Permutation analyses demonstrated that both the microsatellite-instahility status as also the 
country of origin of the samples were highly significant signatures in the expression data. Removal 
of country specific genes improved the separation of MSS from MSI in unsupervised classification 
as demonstrated by multidimensional scaling. 

Supervised classification of the 102 tumor samples using a maximum likelihood classifier with a 
crossvalidation loop resulted in 7 MSI samples being classified as MSS and 1 MSS sample being 
classified as MSI. Re-evaluation including IHC and specific genes expression levels of the 
misclassified MSI tumors indicated that 6 of these tumors were probably truly MSS One MSI 
tumor was a signet ring cell carcinoma with a low tumors fraction Based on this, we excluded the 
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misctassified tumors and re-build the classifier and tested its performance using from 10 or 100 
genes AH 94 tumors samples were classified correctly into MSS and MSI. 

Classifiers should be if tested with an independent testset to prove its strength and more robustness. 
The large difference between MSS and MSI tumors and the large number of tumors allow us to 
separate our dataset into a set for training the classifier and an independent set for testing the 
classifier. We selected 25 MSI and 30 MSS samples for selecting of optimal classification genes 
and used the remaining tumors as an independent test sets for evaluation of classification 
performance of these genes. Since the performance of a classifier may depend on the tumors 
dedicated to the training set we tested the classifier by permutation analysis. The final classifier was 
based on 8 genes and classified MSS and MSI tumors with 98.2% precision. The MSS tumors were 
identified with a sensitivity of 99,8% and with a specificity of 93,8%. 
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Introduction 

Colorectal is the fourth most frequently diagnosed malignancy and the second most common cause 
of cancer death in the western world. Extensive investigated within the past decade has point 
towards two alternative genetic pathways in the development of cancer, the mutator phenotype 
featuring tumors with microsatellite instability (MSI) and the suppressor pathway represented by 
chromosomal instable but microsatellite stable (MSS) tumors The majority of the most common 
hereditary CRC syndrome HNPCC belongs to the group of MSi tumors. 

MSI has been defined as a change of any length due to either insertions or deletions of repeating 
units in a microsatellite within a tumor compared to normal tissue and is caused by an underlying 
defect in the mismatch repair (MMR) system. (Boland et al, CR 1998, 58:5248). A compromised 
MMR system commonly affect genes that include or are linked to microsatellite repeat regions such 
as TGFpRIL ILGF, E2F-4 and BAX (Markowitz et al 1995), genes that arc rarely mutated in MSS 
tumors Furthermore, MSI tumors are diploid and show no loss of heterozygosity whereas MSS 
tumors demonstrate a wide variety in chromosomal number and extensive LQH 
The MSI pathway may either be sporadic or hereditary (HNPCC) and whereas the disruption of the 
MMR system in sporadic MSI tumors is most often caused by somatic methylation of the MLH1 
promoter more that 90% of HNPCC cancers are caused by germline mutations in MLHI or MSH2. 
The MSS pathway to cancer begins with the inactivation of tumor suppressor genes, such as 
APC/p-catenin genes, followed by activation of oncogenes and inactivation of additional tumor 
suppressor genes, commonly with a high frequency of allelic losses and cytogenetic abnormalities 
and abnormal DNA tumor content. 
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Crude survival data suggest that patients with HNPCC have a better prognosis than those with 
sporadic disease and studies have also shown that MSI is an independent indicator of good 
prognosis A large recent study shown that MSS benefit from 5-FU treatment/leucovin treatment 
(Ribic et ah, 2003) m contrast to MSI cancer patients gained no advantage m survival This is the 
exact opposite conclusion from earlier studies, which however used probe collection. 
The recognition of different homogeneous groups of CRC with different pharmacological profiles is 
mandatory for designing the best individual therapy for the individual patient. Many studies have 
defined the pathoclimcal trait of MSI and MSS tumors. MSI positive cancers most frequently found 
in the right side of the colon, they tend to be of less differentiated, they tend to be larger in size, are 
often mucinous and often exhibit extensive infiltration by lymphocytes. 

Other studies have addressed the classification tumors either into dukes* stages B and C 
(Frederiksen and Orntoft, 2003) or different levels of microsatellite instability (Mori et al . 2003) 
Computational scientists in collaboration with medical scientist today readily download datasets 
from the Internet and combine these even across different platforms. It is well know that noise and 
disparities in experimental protocols set strong limits to this form of data integration. Platform 
biases may originate from different probes, i.e. cDNA probes versus oligonucleotides, labelling 
procedures, quality etc. But even studies conducted in one laboratory using a single platform 
underlie the risk of serious biases. Often the samples used have been collected in different clinics 
with adverse procedures The time from resection of a tumor to preservation can change the 
expression of a number of genes as response to ischemia (Huang et al , 2001), the amount of normal 
tissue in the tumor may be very different, information on the location and type of the tumor may not 
be available. This may lead to more or less systematic errors like e.g. samples clustering according 
to batch of labeling (Mori et al , 2003) or procedure for sampling or trimming of the tumor tissue, 



Classification of Microsatellite Instable Colorectal Cancer 
Materials and Methods 

Biological material From the Danish arid Finnish CRC tissue banks 102 primary colorectal cancers 
and 17 macroscopically normal colon epithelium samples from the oral resection edge were chosen. 
Only adenocarcinomas from Dukes' stage B and C were included, however, these represented a 
broad spectrum of tumors in relation to location, heredity, microsatellite instability status, and 
origin of the patient. All tumors were collected in the period from 1994 to 2002. 75 tumor samples 
were collected at nine different clinics in Finland and 47 samples were collected at four different 
clinics in Denmark, 37 were Dukes* B, 65 Dukes 9 C, 25 were sporadic microsatellite highly instable 
(MSI-H), 17 HNPCC and MSI-H, and 59 were sporadic microsatellite stable (MSS) (table I) None 
of the patients received pre-operative radiation or chemotherapy 

Microsatellite analysis. From all tumor samples available as paraffin blocks, ten sections were cut 
at 10^m and stained with haematoxylin. The first and last section was cut at 4 jam, stained with 
haematoxylin, and routinely mounted. These two sections were used for the identification of tumor 
and normals cells from each sample. Regions enriched in tumor cells (more than 90%) were 
microdissected from these sections and DNA was extracted using a Puregene DNA extraction kit 
(Centra Systems, Minneapolis, MN) DNA from blood samples was used as control when available, 
otherwise normal tissue was microdissected from the tissue sections The samples were analyzed for 
microsatellite instability according to the NCI guidelines (Boland et al) using markers BAT25 and 
BAT26 as previously described (Loukola et al. 2001), Some of the Danish samples were difficult no 
definitive result could be obtained. 
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RNA purification Colorectal specimens were obtained fresh from surgery and were immediately 
snap frozen in liquid nitrogen either as was, in OCD-compound or in an SDS/guadinium 
thtocyanate solution, Total RNA was isolated using RNAzol (WAK-Chernie Medical) or spin 
column technology (Sigma) according to the manufacturer's instructions. 

Preparation of labelled aRNA target Ten jug of total RNA was used as starting material for the 
target preparation as described (Dyrskodt et al., 2003). Briefly, the first and second strand cDNA 
synthesis was performed using the Superscript II System (Jnvitrogen) according to the 
manufacturers* instructions except using an ohgo-dT primer containing a T7 RNA polymerase 
promoter site Labelled aRNA was prepared using the BioArray High Yield RNA Transcript 
Labelling Kit (Enzo), Biotin labelled CTP and UTP (Enzo) were used in the reaction together with 
unlabeled NTP's. Following the 1VT reaction, the unincorporated nucleotides were removed using 
RNeasy columns (Qiagen) 

Array hybridization and scanning These procedures were performed at described in detail 
elsewhere (Dyrskodt et al , 2003) Briefly, 15 |4g of cRNA was fragmented, loading onto the 
Affymetrix HG_U133A probe array cartridge and hybridized for 16 h. The probe arrays were then 
washed and stained in the Affymetrix Fluidics Station and scanned using a confocal laser-scanning 
microscope (Hewlett Packard GeaeAtray Scanner G2500A) The readings from the quantitative 
scanning were analyzed by the Affymetrix Gene Expression Analysis Software (MAS 5,0), 
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Data processing 

The arrays were normalized using RMA (robust muhi array, Irizarry et aL 2003) Redundancy of 
probesets as defined form Untgene build 168, was reduced by removing probcscts with high 
correlation (>0.5) over all samples. 

Unsupervised agglomerative hierarchical clustering 

For hierarchical expression cluster analysis 1239 genes with a variation across alt samples greater 
than 0.5 were median-centred and normalized to a magnitude of I. Samples and genes were then 
clustered using average linkage clustering with a modified Person correlation as similarity metric 
(Eisen et aL, 1998). The cluster dendrogram was visualized with TrecVicw {Eisen), 

Group testing 

We make a statistical test where the p- value is evaluated through permutations. For each group and 
gene we calculate the average and the sum of squared deviations from the average We then sum 
these over the genes and the groups: 



This expression is calculated for joining DK with SF and MSI with MSS such that we end up with 
two groups. The sum of squared deviations is denoted S2. As a test statistic we use S1/S2. A smalt 
value indicates that there is a real reduction in the deviations when going from 2 to 4 groups and 
thus the groups have a real significance. To judge if a value is significantly small we use 
permutations. For each of the four groups left when joining DK and SF we randomly allocate the 
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members to a pseudo DK and pseudo SF in such a way that the number of members m each group 
arc as in the original data 

To get an understanding of this separation we performed a test to see if this is caused by a few 
genes or if many genes are involved. For this test we calculated Si = Xgem* Si(gene) and similarly 

with S2 =£genes S2(gene) For each gene j we used the test statistic Si(j)/S2(j) (Table 2.2) 
Multidititcntiortal Scaling 

Multidimentional scaling was performed in R and visualized in a two-dimentional plot. 
Microsatellite status classifiers 

Maximum likelihood classifiers were build as described in Dyrskodt et 2003 For details refer to the 
text. 

Results 

Hierarchical Clustering 

The clinical specimens used in this study were collected in two different countries from 14 different 
clinics in the period 1994 to 200 h The samples were selected to keep a balanced representation of 
microsatellite instable (MSI) and microsatellite stable (MSS) tumors from both the right- and left- 
sided colon, The MSI class was represented both by sporadic MSI and hereditary MSI (HNPCC) 
tumors. Only Dukes' B and Dukes' C tumor samples were included (table 1). Before any attempt to 
divide a diverse sample collection into distinct classes we analyzed the data for systematic bias that 
may have been introduces during the experimental procedures. A fast and easy way to discover both 
true distinct classes as well as systematic biases in the data is to perform a hierarchical clustering. 



Classification of Mtcrosatellite Instable Colorectal Cancer 

The phylogenetic tree resulting from hierarchical clustering on 1239 genes (fig. 1) reveals that the 
main separating factor is mtcrosatellite status. On the upper trunk we find two clusters represented 
mainly by normal biopsies (14/21) and MSS tumors (18/25), respectively. The lower trunk is 
divided into a MSI cluster (30/36) and a second MSS cluster (MSS2-cluster) (34/37). A closer 
inspection of the two MSS clusters unveil that one is dominated by Danish samples (19/25) and one 
by Finnish samples (26/37 check). Also, it is worth to notice that the MSI cluster contains a vast 
majority of Finnish samples (32/36) and that the sporadic MSI samples are interspersed among the 
hereditary samples. The normal biopsies cluster tight together with a slight tendency to separation 
according to origin. Tree normal samples cluster within the MSI cluster indicating that these 
samples may have been resected to close to the tumor lesion. 

Inspection of the gene cluster dendrogram shows that the two groups of MSS tumors arc mainly 
separated by a cluster of approximately 150 genes being upregulated in the Danish samples (data 
not shown) indicating that there is truly a systematic difference between Danish and Finnish 
samples. 

Difference between Danish and Finnish tumor samples 

Based on these observations and concentrating on the tumor samples, we excluded normal samples 
and formed the following four virtual tumor groups: Danish MSI (MSI-DK), Danish MSS (MSS- 
DK), Finnish MSr (MSI-SF) and Tmnish MSS (MSS-SF). Using 5082 genes with a variance above 
0.2, we tested if all the groups are significant or if some of the groups can be joined. We considered 
the two possibilities of joining DK and SF S and joining MSI and MSS and made a statistical test 
where the p-value is evaluated through permutations (Table 2). We see that our test value S1/S2 is 
smaller in our groups than in all permutations demonstrating a very clear separation between DK 
and SF and also a very clear separation between MSI and MSS, To get an understanding of this 
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separation we performed a similar test to see if this is caused by a few genes or if many genes arc 
involved For both the DK-SF and MSI-MSS, we observe that many genes cause this effect (Table 
3). 

When a property is present that influences a large proportion of the genes and tf this influence can 
vary from sample to sample the ordinary normalization procedures can give misleading results. To 
analyze this we calculated distances by multidimensional scaling between samples with and without 
re-scaling of the data and plotted these in a two-dimensional plot (Fig 2 a, b). We find that re- 
scaling of the data improves the separation of the groups significantly. Next, we identified and 
excluded 816 genes that separate DK from SF with at-value numerically greater than 2. which lead 
to a further improvements (Figure 2c), (This plot is not entirely unsupervised since the groups have 
been used to remove gene). We now see a separation of MSI and MSS with Danish and Finnish 
cases mixed The MSI-DK samples are not completely separated as they are found both between the 
MSI-SF and the MSS samples. At this point we looked at the identity of the genes that were 
responsible for the separation of DK from SF. The two genes with the highest fold change were 
S100A8 (4 3 fold upregulated) and Hemoglobin B (5.6 fold upregulated). These genes were also 
two of the most prominent genes identified to be upregulated in tumors as a function of time before 
the samples were frozen after resection (Huang et al ,2001: Yeatman personal communication) 
Thus these genes may represent a response to ischemia and indicate that the sample procedures in 
Denmark and Finland differed in a systematic way. 

Construction of an MSI-MSS classifier Next, we build a maximum likelihood classifier with a 
'leave one ouf cross validation scheme to classify MSI and MSS tumors In order to evaluate the 
effect of systematic differences between DK and SF we constructed classifiers based on 24 genes 
with and without re-scaling of the data and with and without the 816 DK-SF classifier genes All 
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classifiers result in eight errors and these errors are in all cases the same tumors. The specific genes 
used for classification show a overlap of 14-18 of the 24 genes. We see that the classifier works 
well and seems to be independent of the removal of genes and rescalmg, which is due to a large 
difference between MSI and MSS tumors 

It ts noteworthy that seven out of eight errors in the classification of 102 tumor samples were MSI 
samples being classified as MSS and that six of these were from Denmark. In order to understand 
this, we re-evaluated the clinical data for these tumors (table 2). One tumor turned out to be a signet 
cell carcinoma with a low ratio of tumor to normal cell, which may make a correct classification 
difficult. The remaining six MSI tumors were all left sided and are of high to middle grade of 
differentiation. All five of these six MSI-DK tumors that were stained for MLHl and MSH2 in IHC 
were positive for both. The single MSS tumor that misclassified was right sided 
We then looked at expression levels of a number of genes have been described in detail in relation 
to MSI tumors (Fig 3). Most of the misclassified MSI tumors had a class aberrant gene expression 
levels for most for the analyzed genes. Thus expression levels of MLHl, TGFp induces protein 
(TGFBI) and cytokeratin (CK23) were higher compared to MSI whereas thymidyfate synthase 
(TYMS) expression was lower. Based on these data, we conclude that the misclassified tumors 
probably are MSS but are showing an aberrant behavior in the microsatellite test 
We now excluded the 816 ischemia genes and the eight outlier samples and rebuild our classifier. 
We decided to let the classifier select 10 or 100 genes and we choose those genes that were included 
in at least 70% of the crossvahdation loops This resulted in 8 and 96 genes, respectively (Table 4) 
and resulted in correct classification of all tumors. 

Training of a classifier and subsequent testing with an independent dataset 
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It is commonly accepted that a classifier should be tested with an independent testset. The number 
of samples needed for training of a classifier is dependent of the difference between the groups to 
be classifier. Above we have demonstrated that difference between MSS and MSI is relatively large 
and can be well separated with only eight genes. After the exclusion of the misclassified tumors, we 
therefore selected 25 MSI and 30 MSS tumors randomly and allowed the classifier to choose 10 
genes to be used for classification. The remaining 10 MSI and 29 MSS tumors were used as an 
independent test set The performance of a classifier may be dependent on which samples are used 
for training and testing. Therefore we made 100 permutations of training and test sets and calculated 
the number of errors in crossvalidation of the training set and the number of errors in the 
classification of tumors in the test set (Table 5) In the 100 permutations of the training sets we find 
a mean of 7,2% errors of MSI (n==25, range 0-4) and 0,13% errors of MSS (n=30, range 0-1). Of the 
ten genes, eight genes were used in at least 70% of the crossvalidation loops and therefore used for 
classification of the test set (Table 5), Using these eight genes, the mean number of errors in the 
permutated test sets was 6,8% for MSI (n=10, range 0-3) and 0 17% for MSS (n=29, range 0-3) 
resulting in an overall performance of the classifier of 98.2% correct classification (n=39> range 0- 
3), In terms of sensitivity and specificity, the classification of MSS tumors was classified with a 
sensitivity of 98.2% and with a specificity of 96 2%, 

Using the 8-gene classifier, classification of tumors consisting of 26 patients with Dukes B tumors 
showed 14 to be MSI and 12 to be MSS, The overall survival was highly significantly related to the 
classification as no individual died in the MSI group whereas 9 out of 12 died in the MSS group 
(Figure 4). Thus, the 8 gene classifier clearly proved to be a strong predictor of survival in Dukes B 
and it can be used to select patients who need adjuvant chemotherapy, namely those classified as 
MSS. 
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In the Dukes C group 47 were classified as MSS and 13 as MSI There was no significant difference 
in the survival between these groups. A trend was that the MSI showed a poorer survival than the 
MSS, contrary to Dukes B patients This difference can be attributed to the fact that a recent large 
study has shown that chemotherapy only benefit the MSS tumor patients, thus improving their 
survival to a level comparable to that which is characteristic of MSI tumor patients. 
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Figures 

Figure 1 Phylogenctic tree resulting from unsupervised hierarchical clustering. 
Figure 2. Multidimentional scaling plot, 
Figure 3. Expression level of MSI related genes, 

Figure 4. Kaplan-Meier Estimates of Overall Survival among patients with Dukes' B and Dukes* C 
colon cancer according to rmcrosatellitc instability status 

Table 1 . Summary of clinicopathological and microsatellite features of colorectal cancer samples 
Table 2, Permutation test of groups 
Table 3. Permutation test of genes 
Table 4. Performance of the classifier 

Table 5» Genes used for the classification of MSS vs MSI tumors 
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Figure 2. Multidimentional Analysis showing distances between groups of tumors. 
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Figure 4 Kaplan-Meier Estimates of Overall Survival among Patients with Dukes' B ar 
Dukes' C Colon Cancer According to the Microsatellite-Instability Status of the Tumor. 
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Table 2 Permutation test of groups 



Pseudo 


S1/S2 from data 


Smaller values in 


Minimum in 100 


group 




100 permutations 


permutations 


DK-SF 


0.9072795 


0 


0.962269 


I-S 


0.9166195 


0 


0.9583325 
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Table 3. Permutation test of genes 









S,(i)/S 2 (j) 




Pseudo group 




<0,6 


<0.7 <0.8 


<0.9 


DK-SF 


number of genes 


36 


136 522 


1785 




max in 1 00 permutations 


0 


0 2 


225 


MSI-MSS 


number of genes 


17 


103 399 


1507 




max in 100 permutations 


0 


1 8 


250 



Table 



Table 4. Genes for classification of MSS and MSI tumors 
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202589_at 
2l032l_at 
206976_s_at 
208581_x_at 
20432B_x_at 
209504^3^1 
204l3ll$_at 
2l5780_s_at 
2l2341_at 
213738JU* 
207993_s_at 
201849__at 
212859_x_at 
20474 5_x_*\t 
2D2520_s_at 
201762_S_at 
2l6242_S_at 
222244__S_at 
202762_at 
207320_X_at 
201976_s_at 
2l5693_x_at 
212070_et 
2l4924_$_al 
206108_S_at 



AF33338B AF3333S8 AF33338B 

AU031602 AL031B02 AL031602 

ONCL2A Hs 100002 dynem. cyloprasmic. light polypeptide 2A 

CXCL13 Hs 100431 chemokipe (C-X-C rnoti) Hgand 13 (B-cell chemoattractant) 

NCK2 HS 101695 MCK adaptor protein 2 

GNLY Hs 105806 granulysm 

APOL1 Hs 114309 apolipo protein L,1 

BST2 Hs 118110 bone marrow stromal celt antigen 2 

MT2A HS1187B6 metallothionem 2A 

TM4SF8 Hs 121068 transmembrane 4 superfamily member 6 

HCA112 Hs 12128 hepatocellular carcinoma-associated antigen 112 

BIRC3 Hs 127799 bacuIoviraUAP repeat -containing 3 

CYP2B6 Hs 1360 cytochrome P450, family 2. subfamily B, polypeptide 6 

PA2B Hs 14125 p53 regulated PA2B nuclear protein 

TR1M44 Hs 1 45 1 2 tripartite motif-containing 44 

riCui Hs 146612 OEAO/H {A$p-6lu-Ala-Asp/His) box polypeptide ™o/f*njA*TBt 

GALNT6 Hs 1 SWB UDP-N-acetyl-alpha-D-galactosamme polypept.de N-acetylgalactosaminyltransferase 6 (GalNAc-T6) 

TNFSF9 Hs 1 524 tumor necrosis factor {ligand) superfamily, member 9 

APACD Hs 1 53884 ATP binding protein associated with cell differentiation 

CT120 Hs 154396 membrane protem expressed in epilheiiaUike lung adenocarcinoma 

DATF1 Hs 1 553 1 3 death associated transcription factor 1 

CPNE1 Hs 166887 copme I 

MTA1L1 Hs 173043 metastasis-associated 1-ltke 1 

RARRES3 Hs 17466 retmoic acid receptor responder (tazarotene induced) 3 

POFUT1 Hs 178292 protein O-fucosyltransferase 1 

555SM HsS 

FBX021 Hs 1 B4227 F-box only protein 21 

FLJ2031 s Hs 1 8457 hypothetical protein FLJ20315 

EPPK1 Hs 200412 epiplakm 1 

PRF1 Hs 2200 perfonn 1 (pore forming protein) 

C EACAM5 Hs 220S29 carcmoembryonic antigen-related cell adhesion molecule 5 

COC14B Hs 22116 CDC14 cell division cycle 14 homolog B (S cerevisiae) 

ARNTL2 Hs 222024 transcription factor BMAL2 

CXCL10 Hs 224B chemokine {C-X-C motif) ligand 10 

FU20647 Hs 234149 hypothetical protein FLJ20647 

APOL2 Hs 241412 apohpoprotein L, 2 

LY6G6D Hs 24 1 587 lymphocyte antigen 6 complex, locus G6D 

HNRPH1 Hs 245710 heterogeneous nuclear ribonucteoprotein Hi (H) 

D0AH2 Hs 247362 dimethylargmme dimeihylammohydrolase 2 

C20orf35 Hs 2560 B6 chromosome 20 open reading frame 35 

VPS35 Hs 264190 vacuolar protein sorting 35 (yeast) 

G1P3 Hs 265827 interferon, alpha-mducible protein (clone IFN6-16) 

MT1 H Hs 2667 metatlothionem 1 H 

HNRPL Hs 2730 heterogeneous nuclear nbonucleoprotem L 

SC02 Hs 276431 SCO cytochrome oxidase deficient homolog 2 (yeast) 

PLCB4 Hs 283006 phosphoUpase C. beta 4 

MS CP Hs 283716 mitochondrial solute earner protein 

CHN2 Hs 286055 chimenn (chimaenn) 2 

PURA Hs 291 17 punne-nch element binding protem A 

^ ESS* lfera ni yme 2 ,cytot 0 x„T- I y m pr 1 oc yl e. a s S o C ,a,e <JsefI ne e s teras e ) )(H » 

HSPH1 Hs 36927 heat shock 1 05kDa/l lOkOa protem 1 

MT1X Hs 374950 metallothionem IX 

MT1L Hs 380778 metallothionem 1L 

PLEKHB1 Hs 380812 pleckstnn homology domain containing, family B (evectins) member 1 

FOX03A Hs 380831 forkhead box 03A 

Hs 382039 Hs 382039 Homo sapiens, clone IMAGE 4420333, mRNA 

N«;40SaB3 Hs 405983 Homo sapiens CDNA FU21020fis, clone CAE06067 mltc „i Q 

ATP5A1 Hs 405985 ATP synthase, H+ transporting, m.tochondr.al F1 complex, alpha subumt. isoform 1. cardiac muscle 

CHP Hs 406234 calcium binding protein P22 

UBE2L6 Hs 425777 ubiquitm-conjugatmg enzyme E2L 6 

MT1E Hs 433206 metallDthionetn 1E (functional) 

MT1G HS433391 metallothionem 1G 

MLH1 Hs 433B18 mutL homolog 1 , colon cancer, nonpolyposis type 2 (E coli) 

PSME2 Hs 4338 1 0 proteasome (prosome, ma cropain) activator subumt 2 (PA2B beta) 

CGI-85 HS 442630 CGU85 protein 

FLJ2061 B Hs 521 84 hypothetical protem FLJ206 1 8 

ROCK2 Hs 58617 Rho-associated. coiled-coil containing protein kinase 2 

STAU Hs 61 1 3 staufen, RNA binding protem (Drosophila) 

MYO10 HS 61638 myosin X 

DDX27 Hs 65234 OEAO/H (Asp-Glu-Ala-Asp/His) box polypeptide 27 

GPR56 Hs 6527 G prolem-coupled receptor 56 

OlPi 06 Hs 6705 OGT(0-Glc-NAc transferasej-interacting protein 1 06 KDa 

SFRS6 Hs 6891 splicing factor* argmine/senne-nch 6 
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Table 



204BSB_5_at 
213201_S_3t 
20Q814_at 
20628S_s_at 
204103 at 
2Q3559_S^at 
20904B„5Lat 
202678_at 
20391 5jti 
201674~S at 
202203 $_3t 
203773_x_at 
208944 at 
200628 $_al 
204780ls_at 
220951_s_at 
221516_s_at 
217875_s al 
210029_at 
204044 at 
218963 s at 



ECGF1 


H3 7394B 


TNNT1 


Hs 73980 


PSME1 


H3 7534B 


TDGF1 


HS75S61 


CCL4 


H&7S703 


AQP1 


HS 75741 


PRKCBP1 


Hs 75871 


GTF2A2 


HS 76362 


CXCL9 


HS 77367 


AKAP1 


Hs 78921 


AMFR 


Hs 80731 


BLVRA 


Hs 81029 


TGFBR2 


Hs 82028 


WARS 


Hs 82030 


TNFRSF6 


Hs 82359 


ACF 


HS 8349 


FU20232 


Hs 83669 


TMEPAI 


Hs 83883 


IN DO 


Hs 840 


OPRT 


Hs 8935 


KRT23 


Hs 9029 



endothelial cell growth factor 1 (platelet-derived) 
troponin T1, skeletal, slow 

proteasome (prosome, macropam) activator subuml 1 (PA26 alpha) 
teratocarcmoma-denved growth factor 1 
chemokme (C-C motif) tigand 4 

amiloride binding protein 1 (amine oxidase (copper-containing)) 

protein kinase C binding protein 1 

general transection factor ItA, 2, 12kDa 

chemokina (C-X-C motif) ligand 9 

A kinase <PRKA) anchor protein 1 

autocrine motility factor receptor 

biiiverdtn reductase A 

transforming growth factor, beta receptor II (70/60kDa) 

tryptophanyMRNA synthetase 

tumor necrosis factor receptor superfamily, member 6 

apobec-1 complementation factor 

hypothetical protein FLJ20232 

transmembrane, prostate androgen induced RNA 

indoleamine-pyrrole 2.3 dioxygenase 

quinolmate phosphonbosyltransferase {nicotinate*nucieotide pyrophosphatase (carboxylatmg)) 
keratm 23 (histone deacetylase inducible) 
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Table 5 Performance of the classifier 

Trainings set Test set 

Errors in crossvalidation Test errors 

MSI 7.2% (n=25, range 0-4) 6.8% (n=l 0, range 0-3) 

MSS 0. 1 3% (n=30, range 0-1 ) 0.1 7% (n=29, range 0-3) 
AH 3.8% (n=55, range 0-4) 1 .8% (n=39, range 0-3) 



